Skip to main content

Temporal-Difference Learning

TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for an outcome (they bootstrap).

TD Prediction

Whereas Monte Carlo methods must wait until the end of the episode to determine the increment to V(St)V\left(S_t\right) (only then is GtG_t known), TD methods need to wait only until the next time step. At time t+1t+1 they immediately form a target and make a useful update using the observed reward Rt+1R_{t+1} and the estimate V(St+1)V\left(S_{t+1}\right). The simplest TD method makes the update

V(St)V(St)+α[Rt+1+γV(St+1)V(St)]V\left(S_t\right) \leftarrow V\left(S_t\right)+\alpha\left[R_{t+1}+\gamma V\left(S_{t+1}\right)-V\left(S_t\right)\right]

immediately on transition to St+1S_{t+1} and receiving Rt+1R_{t+1}. In effect, the target for the Monte Carlo update is GtG_t, whereas the target for the TD update is Rt+1+γV(St+1)R_{t+1}+\gamma V\left(S_{t+1}\right). This TD method is called TD (0)(0), or one-step TDT D, because it is a special case of the TD(λ)\operatorname{TD}(\lambda).

Finally, note that the quantity in brackets in the TD(0)\operatorname{TD}(0) update is a sort of error, measuring the difference between the estimated value of StS_t and the better estimate Rt+1+γV(St+1)R_{t+1}+\gamma V\left(S_{t+1}\right). This quantity, called the TD error, arises in various forms throughout reinforcement learning:

δtRt+1+γV(St+1)V(St)\delta_t \doteq R_{t+1}+\gamma V\left(S_{t+1}\right)-V\left(S_t\right)

Notice that the TD error at each time is the error in the estimate made at that time. Because the TD error depends on the next state and next reward, it is not actually available until one time-step later. That is, δt\delta_t is the error in V(St)V\left(S_t\right), available at time t+1t+1. Also note that if the array VV does not change during the episode (as it does not in Monte Carlo methods), then the Monte Carlo error can be written as a sum of TD errors:

GtV(St)=Rt+1+γGt+1V(St)+γV(St+1)γV(St+1)=δt+γ(Gt+1V(St+1))=δt+γδt+1+γ2(Gt+2V(St+2))=δt+γδt+1+γ2δt+2++γTt1δT1+γTt(GTV(ST))=δt+γδt+1+γ2δt+2++γTt1δT1+γTt(00)=k=tT1γktδk\begin{aligned} G_t-V\left(S_t\right) &=R_{t+1}+\gamma G_{t+1}-V\left(S_t\right)+\gamma V\left(S_{t+1}\right)-\gamma V\left(S_{t+1}\right) \\ &=\delta_t+\gamma\left(G_{t+1}-V\left(S_{t+1}\right)\right) \\ &=\delta_t+\gamma \delta_{t+1}+\gamma^2\left(G_{t+2}-V\left(S_{t+2}\right)\right) \\ &=\delta_t+\gamma \delta_{t+1}+\gamma^2 \delta_{t+2}+\cdots+\gamma^{T-t-1} \delta_{T-1}+\gamma^{T-t}\left(G_T-V\left(S_T\right)\right) \\ &=\delta_t+\gamma \delta_{t+1}+\gamma^2 \delta_{t+2}+\cdots+\gamma^{T-t-1} \delta_{T-1}+\gamma^{T-t}(0-0) \\ &=\sum_{k=t}^{T-1} \gamma^{k-t} \delta_k \end{aligned}

This identity is not exact if VV is updated during the episode (as it is in TD(0)\operatorname{TD}(0) ), but if the step size is small then it may still hold approximately. Generalizations of this identity play an important role in the theory and algorithms of temporal-difference learning.

Advantages of TD Prediction Methods

  • TD methods generally converge faster than MC methods, although this has not been formally proven.
  • TD methods do converge on the value function with a sufficiently small step size parameter, or with a decreasing step size.
  • They are extremely useful for continuing tasks that cannot be broken down in episodes as required by MC methods.

Sarsa: On-policy TD Control

In the previous section, we considered transitions from state to state and learned the values of states. Now we consider transitions from state-action pair to state-action pair and learn the values of state-action pairs. Formally these cases are identical: they are both Markov chains with a reward process. The theorems assuring the convergence of state values under TD(0)\operatorname{TD}(0) also apply to the corresponding algorithm for action values:

Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)]Q\left(S_t, A_t\right) \leftarrow Q\left(S_t, A_t\right)+\alpha\left[R_{t+1}+\gamma Q\left(S_{t+1}, A_{t+1}\right)-Q\left(S_t, A_t\right)\right]

This update is done after every transition from a nonterminal state StS_t. If St+1S_{t+1} is terminal, then Q(St+1,At+1)Q\left(S_{t+1}, A_{t+1}\right) is defined as zero. This rule uses every element of the quintuple of events, (St,At,Rt+1,St+1,At+1)\left(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}\right), that make up a transition from one state-action pair to the next. This quintuple gives rise to the name Sarsa for the algorithm.

Q-learning: Off-policy TD Control

One of the early breakthroughs in reinforcement learning was the development of an off-policy TD control algorithm known as Q-learning (Watkins, 1989), defined by

Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)]Q\left(S_t, A_t\right) \leftarrow Q\left(S_t, A_t\right)+\alpha\left[R_{t+1}+\gamma \max _a Q\left(S_{t+1}, a\right)-Q\left(S_t, A_t\right)\right]

In this case, the learned action-value function, QQ, directly approximates qq_*, the optimal action-value function, independent of the policy being followed. This dramatically simplifies the analysis of the algorithm and enabled early convergence proofs. The policy still has an effect in that it determines which state-action pairs are visited and updated. However, all that is required for correct convergence is that all pairs continue to be updated.

Concurrent Q-Learning

Expected Sarsa

Instead of updating our value function with the value maximizing action at St+1S_{t+1} (as is the case with Q-learning) or with the action prescribes by our ϵ\epsilon-greedy policy (as is the case with SARSA), we could make updates based on the expected value of QQ at St+1S_{t+1}. This is the premise of expected sarsa. Doing so reduces the variance induced by selecting random actions according to an ϵ\epsilon-greedy policy. Its update is described by:

Q(St,At)Q(St,At)+α[Rt+1+γEπQ(St+1,At+1St+1)Q(St,At)]Q(St,At)+α[Rt+1+γaπ(aSt+1)Q(St+1,a)Q(St,At)]\begin{aligned} Q\left(S_t, A_t\right) & \leftarrow Q\left(S_t, A_t\right)+\alpha\left[R_{t+1}+\gamma \mathbb{E}_\pi Q\left(S_{t+1}, A_{t+1} \mid S_{t+1}\right)-Q\left(S_t, A_t\right)\right] \\ & \leftarrow Q\left(S_t, A_t\right)+\alpha\left[R_{t+1}+\gamma \sum_a \pi\left(a \mid S_{t+1}\right) Q\left(S_{t+1}, a\right)-Q\left(S_t, A_t\right)\right] \end{aligned}

We can adapt expected SARSA to be off-policy by making our target policy π\pi greedy, in which case expected SARSA becomes Q-learning. It is therefore seen as a generalization of Q-learning that reliably improves over Sarsa.